Bioinformatics (Thomas Dandekar, Meik Kunz)

277

to annotate an unknown protein on the second BLAST run by including the unknown

sequences in my search on the first run, i.e. doing a position-specific iterative matrix search

(which is what Psi is in Psi-BLAST). But if you want to do this well, you should make

many additional predictions, including looking at the structure, the localization, the

domains, and only then you get quite good results and insights into the function (see also

examples PROSITE and AnDom). The work of Gaudermann P, Vogl I, Zientz E et al

(2006) Analysis of and function predictions for previously conserved hypothetical or puta

tive proteins in Blochmannia floridanus. BMC Microbiol. 2006;6:1). If one wants to be

more precise, like the ENCODE consortium, and find all regulatory elements in a genome

(and not just the proteins or genes), then it is advisable to map out conserved regions via

closely related genomes and also to use active motif search programs such as motif-based

sequence analysis tools (MEME) (for this, read the paper https://www.sdsc.edu/~tbailey/

papers/meme.ml.pdf and refer to the web site https://meme-suite.org/doc/meme.html).

Very handy to identify repetitive elements (recurring units) is the general software

RepeatMasker (https://www.repeatmasker.org). We have also developed our own server,

L1base, which finds LINE elements, i.e. large, repetitive, selfish DNA sequences (https://

line1.bioapps.biozentrum.uni-wuerzburg.de/; here you are redirected to the Charité page,

https://l1base.charite.de, which shows the current further development of the server and a

documentation). Another possibility is to search for repeats in protein sequences, where

the tool REPRO (based on local alignment, Smith-Waterman, and subsequent iterative

clustering; https://www.ibi.vu.nl/programs/reprowww/) is very useful. Again, the docu

mentation on the website is recommended. Genome annotation then quickly becomes a

science in itself. For the human genome, relevant sites are already recommended in the

book chapter, but also mentioned here. The ENCODE entry page already mentioned also

19.1 Genomic Data: From Sequence to Structure and Function